Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: account for number of header columns in dialect detection #1359

Merged
merged 1 commit into from
Jan 30, 2024

Conversation

psFried
Copy link
Member

@psFried psFried commented Jan 30, 2024

When we parse CSVs, we consider it an error for any row to contain more values than there are header columns. But the dialect detection wasn't consistent with that behavior, and if it encountered such a row it would score it higher than it would a row containing fewer values than there are headers. The consequence of that is that we could end up scoring an incorrect quote character higher than a correct one if it produces more columns (which often the case when quoted values contain delimiters). This commit addresses that oversight by zeroing the score of any row that contains too many values. Thus it is treated the same as if the row couldn't be parsed at all. The result is that dialect detection produces a much more accurate guess of the correct quote character.


This change is Reviewable

When we parse CSVs, we consider it an error for any row to contain more values
than there are header columns. But the dialect detection wasn't consistent with
that behavior, and if it encountered such a row it would score it higher than
it would a row containing fewer values than there are headers. The consequence
of that is that we could end up scoring an incorrect quote character higher
than a correct one if it produces more columns (which often the case when
quoted values contain delimiters). This commit addresses that oversight by
zeroing the score of any row that contains too many values. Thus it is treated
the same as if the row couldn't be parsed at all. The result is that dialect
detection produces a much more accurate guess of the correct quote character.
@@ -106,9 +119,14 @@ pub fn detect_dialect(
.pop()
.expect("must have at least one candidate dialect");
// Log the top few candidates, as it's helpful to see the runner up when detection doesn't go as we expected
let runners_up = &dialects[0..(dialects.len().min(3))];
let runners_up = &dialects[dialects.len().saturating_sub(3)..];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous expression here was incorrect. So this now correctly prints the top 3 runners up, with the last value being the 2nd place finisher.

Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@psFried psFried merged commit 9b390c5 into master Jan 30, 2024
3 checks passed
@psFried psFried deleted the phil/csv-detect-better branch January 30, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants